Windows Azure Storage Characteristics

10/29/2010 7:32:19 PM

1. Lots and Lots of Space

All the storage services can take huge amounts of data. People have been known to store billions of rows in one table in the table service, and to store terabytes of data in the blob service. However, there is not an infinite amount of space available, since Microsoft owns only so many disks. But this is not a limit that you or anyone else will hit.

You could draw a parallel to IPv6, which was designed as a replacement for the current system of IP addresses, IPv4. The fact is that IPv4 is running out of IP addresses to hand out—the limit is 4.3 billion, and the number of available IP addresses is quickly diminishing. IPv6 has a finite number of addresses as well, but the limit will never be reached, since that finite number is incredibly large: 3.4 × 10³⁸, or 3.4 followed by 38 zeros!

The same holds true for these storage services. You need not worry about running out of space on a file share, negotiating with vendors, or racking up new disks. You pay only for what you use, and you can be sure that you’ll always have more space if you need it. Note that this represents the total data you can store. There are limits on how large an individual blob or a table entity/row can be, but not on the number of blobs or the number of rows you can have.

2. Distribution

All the storage services are massively distributed. This means that, instead of having a few huge machines serving out your data, your data is spread out over several smaller machines. These smaller machines do have higher rates of failure than specialized storage infrastructure such as storage area networks (SANs), but Windows Azure storage deals with failure through software. It implements various distributed software techniques to ensure that it stays available and reliable in the presence of failures in one or more of the machines on which it runs.

3. Scalability

All the storage services are scalable. However, scalable is a loaded, often abused word. In this context, it means your performance should stay the same, regardless of the amount of data you have. (This statement comes with some caveats, of course. When you learn about the table service, you’ll see how you can influence performance through partitioning.) More importantly, performance stays the same when load increases. If your site shows up on the home page of Slashdot or Digg or Reddit, Windows Azure does magic behind the scenes to ensure that the time taken to serve requests stays the same. Commodity machines can take only so much load, because there are multiple mechanisms at play—from making multiple copies to having multiple levels of hot data caching.

4. Replication

All data is replicated multiple times. In case of a hardware failure or data corruption in any of the replicas, there are always more copies from which to recover data. This happens under the covers, so you don’t need to worry about this explicitly.

5. Consistency

Several distributed storage services are eventually consistent, which means that when an operation is performed, it may take some time (usually a few seconds) for the data you retrieve to reflect those changes. Eventual consistency usually means better scalability and performance: if you don’t need to make changes on several nodes, you have better availability. The downside is that it makes writing code a lot trickier, because it’s possible to have a write operation followed by a read that doesn’t see the results of the write you just performed.

Note: Don’t misinterpret this description—eventual consistency is great in specialized scenarios. For example, Amazon’s shopping cart service is a canonical example of an eventually consistent application. The underlying store it writes to (Dynamo) is a state-of-the-art distributed storage system. It lets Amazon choose between insert/add performance and not retrieving the latest version of the data. It doesn’t reflect changes instantly, but by not having to do so, it gives Amazon the ability to add items to shopping carts almost instantly, and more importantly, to never miss an item added.Amazon decided that not losing shopping cart items was very important and worth the trade-off of a minuscule percentage of users’ shopping carts not seeming to have all items at all times. For more information, read Amazon’s paper on the topic at http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf.

Windows Azure storage is not eventually consistent; it is instantly/strongly consistent. This means that when you do an update or a delete, the changes are instantly visible to all future API calls. The team decided to do this since they felt that eventual consistency would make writing code against the storage services quite tricky, and more important, they could achieve very good performance without needing this. While full database-style transactions aren’t available, a limited form is available where you can batch calls to one partition. Application code must ensure consistency across partitions, and across different storage account calls.

6. RESTful HTTP APIs

All the storage services are exposed through a RESTful HTTP API. You’ll learn about the building blocks of these APIs later in this chapter. All APIs can be accessed from both inside and outside Microsoft data centers. This is a big deal. This means that you could host your website or service in your current data center, and pick and choose what service you want to use. For example, you could use only blob storage, or use only the queue service, instead of having to host your code inside Windows Azure as well. This is similar to how several websites use Amazon’s S3 service. For example, Twitter runs code in its own infrastructure, but uses S3 to store profile images.

Another advantage of having open RESTful APIs is that it is trivial to build a client library in any language/platform. Microsoft ships one in .NET, but there are bindings in Python, Ruby, and Erlang, just to name a few. Later in this chapter, you will learn how to build a rudimentary library to illustrate the fundamental concepts, but in most of the storage code, you’ll be using the official Microsoft client library. If you want to implement your own library in a different language/environment, just follow along through the sample in this chapter, and you should find it easy to follow the same steps in your chosen environment.

7. Geodistribution

When you create your storage account, you can pick in which geographical location you want your data to reside. This is great for not only ensuring that your data is close to your code or your customers, but also spreading your data out geographically. You don’t want a natural disaster in one region to take out your only copy of some valuable data.

8. Pay for Play

With Windows Azure storage, like the rest of Windows Azure, you pay only for the storage you currently use, and for bandwidth transferring data in and out of the system.